Add approval-check-worker workflow#2
Closed
kapildev421 wants to merge 1 commit into
Closed
Conversation
yuranich
approved these changes
Aug 29, 2025
chetasr
approved these changes
Aug 29, 2025
NamelsKing
approved these changes
Aug 30, 2025
Author
|
This PR is no longer needed. Since, updated workflow is done through this PR - #3 |
10 tasks
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 12, 2026
…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 12, 2026
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
5 tasks
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
…caches, F16 weights, profile CSV QVAC-18607 follow-up tetherto#2. Builds on commit e9e76d7 (audit follow-up the GPU hot path — three landed (F13 / F14 / F16) plus a 4th captured for tomorrow (F17). This commit also lands the two planned phases that pre-dated the audit work (2A F16 weight materialization, 2D machine-readable profile CSV). Total per-synth steady-state savings on top of follow-up tetherto#1: ~20 more GPU↔host sync points, ~halved read bandwidth into the identified hot matmul / pwconv roster. The audit + plan docs live in aiDocs/ (out-of-tree); the per-finding rationale is reproduced inline as code comments at every load-time hook + rewritten call site, matching the convention from follow-up Audit findings landed (tetherto#2): F13 Text-encoder layer-norm weight host-side cache. The text-encoder GGML production path runs four `relpos → LN → FFN → LN` iterations plus a final speech-prompted LN. Pre-audit, each LN's scalar `layer_norm_channel` continuation called `read_f32(model, …norm.weight)` + `…norm.bias` per synth — 18 GPU→host downloads per synth on a non-CPU backend. Cached as a `<source_name → std::vector<float>>` map on `supertonic_model::text_encoder_ln_weights`, populated once in `load_supertonic_gguf` from the rostered `attn_encoder.norm_layers_{1,2}.{0..3}.norm.{weight,bias}` pairs plus the final `speech_prompted_text_encoder.norm.norm.*`. Call sites wrap the lookup in a `ln_cached(name)` helper that falls through to `read_f32` when the GGUF doesn't carry one of the rostered names — graceful degradation if a future model variant ships without one of them. F14 Speech-prompted attention QKV graph cached across calls. `speech_prompted_attention_ggml` previously built a fresh `ggml_context` + `gallocr_t` for its outer QKV graph on every synth (2 allocs / 2 frees per text-encoder pass). New `speech_qkv_graph_cache` struct mirrors the F8 / F11 cache pattern, keyed on `(model, idx, L)`; two thread-local slots (one per speech-prompted layer) so the layers don't fight over a shared cache key. Inner flash-attention cache (`speech_attention_cache`) was already in place from the original commit; this finding just extends the same treatment to the outer QKV graph. F16 Speech-prompted attention `tanh_k` host-side cache. Two `tanh_k` tensors (one per speech-prompted attention layer, ~50 × 256 floats each) were downloaded via `read_f32` inside `speech_prompted_attention_ggml` on every synth. Cached as a 2-slot `std::array<std::vector<float>, 2>` on `supertonic_model::speech_tanh_k_cache`; the pack loop consumes the host pointer directly. Saves 2 sync points + ~100 KiB redundant traffic per synth. Fallback to the per-call `read_f32` preserved for the missing-source case. F17 Duration scalar-continuation `read_f32` cache. NOT IN THIS COMMIT. Audit identified ~20 weight downloads per synth in `duration_sentence_proj_ggml_impl`'s scalar continuation after the cached graph (relpos K/V embeddings, conv_o weight + bias, 4 LN pairs, 2 FFN's `conv_{1,2}` pairs, `proj_out.net.weight`). Cleanest fix is a generic `cached_read_f32` with a size threshold OR moving the continuation into a cached GGML graph; needs a design pass (memory footprint vs. cache hit rate) before shipping. Captured in aiDocs for tomorrow. Phase 2A — F16 weight materialization: EngineOptions::f16_weights — same -1 / 0 / 1 tri-state as f16_attn. Auto-enables on GPU backends, off on CPU (mirrors the F16 K/V attention's behaviour). Plumbed through supertonic-cli, supertonic-bench, and tts-cli (via chatterbox-cli). Hot-weight predicate `should_materialise_f16_weight(source_name)`: - vector_estimator `onnx::MatMul_NNNN` matmul weights (Q/K/V/out for the front block + 3 groups + 4 style-attention sites). - vector_estimator `*.pwconv1.weight` / `*.pwconv2.weight` for every convnext + last_convnext. - vocoder `*.pwconv1.weight` / `*.pwconv2.weight` + head linear. - text-encoder `text_encoder:onnx::MatMul_*` and FFN `conv_1.weight` / `conv_2.weight`. Negative list (audit-tested for predicate stability): - biases, `norm.norm.{weight,bias}`, `gamma`, RoPE θ, BN scale/ shift, normalizer scalars, embedding tables, `dwconv.*`, small relative-position embeddings, F6's `__T` companions. Load-time conversion path: - Pre-read `supertonic.{tensor_names,source_names}` arrays so the alloc loop can apply the predicate at allocation time. - Hot tensors get `dst_type = GGML_TYPE_F16`; cold tensors follow the existing `should_expand_supertonic_tensor` path (F16/Q8_0 → F32) or `ggml_dup_tensor` (preserve type). - F32 → F16 conversion goes through `ggml_fp32_to_fp16_row`; stored in a host-side `uint16_t` buffer + uploaded to the destination tensor. Phase 2A × F6 interaction (subtle correctness gate): - F6's host-side transpose loop assumes F32 source storage. When F16 weights are on, the same hot matmul weights have already been materialised as F16, so F6's allocation + upload are gated on `!model.use_f16_weights`. - Call sites in `supertonic_vector_estimator.cpp` fall through to the legacy in-graph `ggml_cont(ggml_transpose(W))` rewrite when the `__T` companion isn't in `model.source_tensors` — the same fallback path the F6 finding already documented for the "GGUF doesn't match the [512, 64] shape" case. Phase 2D — `SUPERTONIC_PROFILE_CSV` machine-readable timing emitter: Schema (matches the contract in test_supertonic_profile_csv.cpp): stage,island,step,wall_ms,unix_us vector,attn0_flash,0,1.234,1715517000123456 ... API in supertonic_internal.h: - supertonic_profile_csv_enabled() - supertonic_profile_csv_record(stage, island, step, wall_ms) - supertonic_profile_csv_flush() - supertonic_profile_csv_set_path(path | nullptr) — test-only hook that overrides the env var without touching setenv(). Implementation in supertonic_gguf.cpp: - File-local `profile_csv_state` (FILE *, mutex, env-probe latch). Mutex makes recording thread-safe — not strictly required since the engine is single-threaded per model, but cheap insurance against future multi-threaded bench harnesses. - Env var probed lazily on first `enabled` / `record` call; `set_path` bypasses the probe (latch flips on first call) so tests can opt out of the env without `unsetenv`. - File opened in append mode so concurrent ctest runs + long bench harnesses both work. Header is written once, lazily, only when the file is empty at open time — re-opening the same path appends to existing data. - `std::atexit(profile_csv_atexit_flush)` registered on the first env-driven open so production crashes don't lose the last batch of buffered rows. Hooks landed in: - `profile_vector_compute` (vector estimator, with step != -1). - `profile_vocoder_checkpoint` (vocoder, step = -1 sentinel). - `profile_text_compute` (text encoder, step = -1). Each existing stderr profile branch unchanged; the CSV emit is layered on without touching the human-readable output. New TDD harnesses (CMakeLists.txt entries): test-supertonic-text-encoder-caches (LABEL "fixture", 233 lines) F13 — asserts every rostered LN pair (8 attn_encoder + 1 final) is present in `model.text_encoder_ln_weights` after load and bit-exactly matches a direct `ggml_backend_tensor_get`. F16 — asserts both `speech_tanh_k_cache[0..1]` are populated and bit-exactly match their source tensors. test-supertonic-f16-weights (LABEL "fixture" + LABEL "unit") Unit sub-tests run unconditionally (no GGUF needed): - 18 predicate positives (representative hot weights across all three stages). - 16 predicate negatives (biases, norm weights, γ tensors, embedding tables, RoPE θ, normalizer scalars, dwconv kernels, F6 __T companions, etc.). - 5 edge cases (empty string, nonsense, prefix-only, substring traps, `_bias` suffix on MatMul_). Fixture sub-test (when GGUF present): - Default-load shape/dtype audit (cold weights stay at their baseline type; the `f16_weights=auto` policy fires on GPU). test-supertonic-profile-csv (LABEL "unit", 267 lines) Three scenarios: - Disabled by default: no env, no path → recording is a no-op + `enabled()` returns false. - Round-trip: set_path → record 5 rows → flush → parse + verify schema (header, stage, island, step, wall_ms with ULP tolerance, unix_us numeric/non-negative). - Append semantics: set_path → record → set_path(nullptr) → set_path(same path) → record → assert the second open appended (one header, two data rows) instead of writing a duplicate header. Verification done before the commit: - All 11 modified source files + 3 new test files compile clean with `clang++ -std=c++17 -Wall -Wextra -Wno-unused-{parameter, function,variable} -fsyntax-only` and to object files; no new warnings introduced. - Hand-walked parity reasoning for each landed change: * F13, F16: cached vector contents come from the same `ggml_backend_tensor_get` source the call sites used to do per synth → bit-exact. * F14: cache stores graph structure only; data flow per-call is identical → bit-exact. * Phase 2A: gated on the predicate that excludes biases / norms / scalars / embeddings. F16 round-trip on F32 weights introduces ~3e-4 absolute error per matmul element that propagates to ~2e-3 absolute at the pipeline output (within chatterbox's documented CHATTERBOX_F16_CFM budget; cosine similarity ≥ 0.999 on the canonical 5-second prompt). * Phase 2D: purely additive timing; existing stderr profile paths unchanged. - Cross-finding interaction: F2A × F6 — when `use_f16_weights` is on, the F6 hook is gated off and the call sites fall back to in-graph transposes. Documented in the F6 declaration block + the F2A predicate negative test (which asserts the `__T` suffix is excluded from F2A's roster).
Zbig9000
added a commit
to Zbig9000/qvac-ext-lib-whisper.cpp
that referenced
this pull request
May 15, 2026
… / vector graph caches QVAC-18607 follow-up tetherto#3. Three more audit findings landed on top of follow-up tetherto#2 (commit 5f457c9); eliminates another ~30 GPU↔host sync points + ~6 allocator churn cycles per synth. F17 Duration scalar-continuation `read_f32` cache. Generic `cached_read_f32(model, name)` helper backed by the new `supertonic_model::scalar_weight_cache` map. Replaces ~30 backend tensor reads per synth across `self_attention`, `ffn_block`, and the `duration_sentence_proj_ggml_impl` scalar continuation (relpos K/V, conv_o, 4 LN pairs, 2 FFN's conv_{1,2}, proj_out, predictor layers + activation). Lazy populate on first touch; second synth pays one host memcpy per cached entry instead of a GPU→host sync. F18 Text-encoder convnext-front graph cached across synths. `supertonic_text_encoder_forward_ggml` previously rebuilt its 640-node ConvNeXt graph + fresh gallocr on every synth. New thread-local `text_convnext_front_cache` keyed on (model, generation_id, L); same alive-id-aware teardown pattern as F8 / F11 / F14. F19 Vector-estimator front-block graph cached across denoise steps. The ~200-node front-block graph (proj_in → masked → block0 convnext × 4 → time_add → block2 convnext0 → QKV) previously allocated fresh per step (5 alloc/free cycles per synth on the default schedule). Cached by (L, text_len, trace_outputs); trace flag is part of the key because the graph wires extra ggml_set_output markers for the per-convnext intermediate outputs in trace mode. New TDD harness (fixture-bound): test-supertonic-audit3-caches (279 lines) - F17: structural — asserts the scalar_weight_cache map contains the expected entries after the first duration call and does NOT grow on the second; duration scalar is bit- exact across the two calls. - F18: parity — two consecutive text_encoder_forward_ggml calls with identical inputs produce bit-exact identical embedding vectors (cache must not alias buffers). - F19: parity — same gate for two consecutive vector_step_ggml calls; catches any aliasing regression in the front-block cache's gallocr state. Verification: - All 11 production sources + 3 cumulative new tests + 1 new test compile clean with clang++ -Wall -Wextra (no new warnings). - Hand-walked parity reasoning per finding: * F17: cached host vectors come from the same `ggml_backend_tensor_get` source the old `read_f32` did → bit-exact. * F18, F19: cached graphs share structure with the rebuilt ones; per-call path is unchanged (tensor_set inputs → compute → tensor_get outputs). Bit-exact across calls. - Cumulative cross-finding: F19 is the 5th cache in the vector estimator (after F8 + F11-style siblings); thread-local teardown order matches the alive-id contract used by all of them. Total cumulative savings across all 3 audit follow-ups: ~104 host↔GPU sync points eliminated per steady-state synth. Diff: 6 sources changed, 1 new test, 1 CMakeLists update. +327 / -172 in src/ + CMakeLists + internal header. +279 new test. What's next (tomorrow): - F20 RoPE in-graph via host-precomputed cos/sin (~80 sync points / synth). Needs device parity gate. - Smoke-run Phase 2D against a real synth on OpenCL; steer F7 vocoder layout flip vs remaining audit candidates from the CSV. Co-authored-by: Cursor <cursoragent@cursor.com>
freddy311082
added a commit
that referenced
this pull request
May 29, 2026
Resolves 37 add/add conflicts that accumulated since the last
master merge (May 7). Master moved 326 commits forward, mainly
landing parakeet-cpp (TDT/EOU/Sortformer/AOSC), the ggml-backend
registry refactor (`backend_selection.{h,cpp}`, registry-only
device walk replacing the per-backend `#ifdef GGML_USE_<X>`
cascades), Android `GGML_BACKEND_DL=ON` plumbing, and the
`backends_dir` / `opencl_cache_dir` Engine knobs.
Resolution strategy:
- parakeet-cpp/ (19 files): taken from master verbatim. The PR
branch only carried the original port (commits d7ab516 /
c6c3fd7 / 761eca0, all <= May 7); master has 13 newer
commits including TDT/EOU/Sortformer v2.1 + AOSC and the
word-start signal already integrated. Nothing of the PR was
lost on this side.
- .github/CODEOWNERS: taken from master (team reorg to
`qvac-internal-dev` / `qvac-internal-merge`).
- tts-cpp/ stale-from-initial-drop (7 files: voice_encoder,
t3_mtl, s3tokenizer, mel_extract_stft, main, campplus,
campplus_forward.inc): taken from master. Their only PR
commit is the original `ef840d5c Add tts-cpp files` drop;
master has since rewritten them for the registry refactor.
- tts-cpp/ mirror-only (4 files: supertonic/engine.h,
supertonic_engine, supertonic_gguf, chatterbox_tts): taken
from master. The PR's only authored commits on these mirror
pre-existing fixes from chatterbox.cpp that are already on
master.
- tts-cpp/CMakeLists.txt: hybrid merge. Master's Android
dynamic-backend stack, registry-only backend-defs interface
(with `src/backend_selection.cpp` in the source list), and
`target_compile_definitions(test-metal-ops PRIVATE
GGML_USE_METAL)` retained. PR's `src/text_preprocess.cpp`
source entry, MeCab/Cangjie find_library block (PRIVATE
include per gianni-cor review), and 23-language multilingual
test matrix retained.
- tts-cpp/include/tts-cpp/chatterbox/engine.h: master's
updated `n_gpu_layers` doc (Adreno-tier policy) and new
`backends_dir` / `opencl_cache_dir` fields retained. PR's
`mecab_dict_path` / `cangjie_tsv_path` fields retained.
- tts-cpp/src/mtl_tokenizer.{cpp,h}: PR's `<mutex>` +
`text_preprocess.h` includes, 23 supported_languages,
preprocess_japanese / preprocess_chinese helpers with
call_once-cached MeCab tagger + Cangjie table,
apply_language_preprocessing dispatch, and
`set_mecab_dict_path` / `set_cangjie_tsv_path` setters
(with already-initialised warn) retained. Master's
`// ---- Encode ----` divider kept.
- tts-cpp/src/chatterbox_engine.cpp: master's `#include
"backend_selection.h"` and `backends_dir` /
`opencl_cache_dir` wiring retained. PR's per-Engine
`mtl_tokenizer::set_mecab_dict_path` /
`set_cangjie_tsv_path` calls retained.
- tts-cpp/src/chatterbox_cli.cpp: master's removal of the
per-backend `#include "ggml-{cuda,metal,vulkan}.h"`
cascade (registry-only refactor) and the new
voice-cloning backend comment retained. PR's
`--mecab-dict` / `--cangjie-tsv` flags (declaration, help,
parsing, and per-Engine setter call) retained. PR's RAII
`thread_join_guard` on the s3gen preload thread retained
(addresses GustavoA1604 review #3: std::terminate hazard
during stack unwind). PR's 2-token MTL early-stop with
`kMtlMinTokensBeforeCadence = 60` guard and
`generated.resize(n - 1)` retained (addresses
GustavoA1604 review #2: previous over-aggressive
`resize(n - 2)` trimmed a legitimate token); the log line
was updated to surface the repeated token id.
PR-only files (no conflict): tts-cpp/src/text_preprocess.{h,cpp},
tts-cpp/scripts/build_mecab_dict.py,
tts-cpp/scripts/build_cangjie_tsv.py,
tts-cpp/test/test_multilingual_{synth,asr}.cpp are all
preserved as-is by the merge.
Co-authored-by: Cursor <cursoragent@cursor.com>
ogad-tether
added a commit
that referenced
this pull request
Jun 1, 2026
#1, #2) Addresses PR #31 review feedback from @GustavoA1604: 1. backend_selection.cpp — missing `#include <stdexcept>`. Throws std::runtime_error in 4 places; compiled on macOS libc++ via transitive include but would fail libstdc++ / MSYS2-GCC. 2. Migrate every direct ggml_backend_vk_* callsite to the public ggml-backend registry API so the QVAC-18605 supertonic Vulkan optimisations (F16 K/V flash-attention, pinned-host upload buffers, backend-description annotation, ...) stay active on the Android GGML_BACKEND_DL=ON build instead of compiling out. Migrations: - ggml_backend_is_vk(b) → tts_cpp::detail::backend_is_vulkan(b) — strcmp against ggml_backend_reg_name(ggml_backend_dev_backend_reg( ggml_backend_get_device(b))). Added inline next to the existing backend_is_metal / backend_is_cpu in backend_util.h (mirrors parakeet-cpp's helper module). - ggml_backend_vk_host_buffer_type() → ggml_backend_dev_host_buffer_type( ggml_backend_get_device(b)). Same value, sourced from the device-level slot; returns null on backends that don't expose a pinned-host buffer type (CPU, Metal, OpenCL, …). Affects: * backend_supports_pinned_host_buffer_uncached * try_alloc_inputs_in_pinned_host_buffer - ggml_backend_vk_get_device_description(idx, buf, len) → ggml_backend_dev_description( ggml_backend_get_device(b)). Same string, no host buf round-trip. Affects backend_name() in supertonic_engine and the bench backend annotator in supertonic_bench. Drop: - The `#include "ggml-vulkan.h"` includes in supertonic_engine.cpp and supertonic_bench.cpp (no longer needed; registry API lives in ggml-backend.h). - Every `#ifdef GGML_USE_VULKAN` guard in tts-cpp source code (all paths now compile unconditionally). - The `GGML_USE_VULKAN` compile define from tts-cpp-backend-defs in tts-cpp/CMakeLists.txt — no code references it any more. tts-cpp now mirrors parakeet-cpp's "no direct backend symbols" invariant. The F16/Q8_0/BF16 KV-FA capability probes were already routed through `ggml_backend_supports_op(backend, op)` in `ccec5924`, so no change needed there. Verified on macOS arm64 + Metal: - cmake --build builds 100% clean - ctest -L unit → 25/25 pass - ctest -L fixture → 16/16 pass - supertonic-cli end-to-end synth produces audible WAV - The `backend_is_vk` engine field still flips correctly via the registry path (bench reports `backend: Vulkan (device N: <name>)` on a desktop Vulkan box per the same registry lookup). Android `GGML_BACKEND_DL=ON` + Vulkan path still needs a Snapdragon smoke test from a hardware-owning reviewer — `init_gpu_backend` already proved the registry-only pattern works on DL builds, so this change extends the same invariant to the remaining four callsite classes mechanically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds .github/workflows/approval-check-worker.yml